Multilingual data processing in the CELLAR environment

نویسنده

  • Gary F. Simons
چکیده

class Encoding has name : String description : String class CharacterSet based on Encoding has characters : sequence of CharacterDefn class LanguageEncoding base on Encoding has language : Language characterSet : refers to CharacterSet Figure 6 summarizes the basic model in a diagram. It shows that every string stores a pointer to its encoding which may be either a LanguageEncoding or a CharacterSet. If it is a LanguageEncoding, then that object includes an identification of the language and a pointer to the CharacterSet in which it is encoded. Note that many LanguageEncodings may use (that is, point to) the same CharacterSet. 5.2 Defining character sets and their characters It is well known that the number of scripts (or writing systems) in the world is much lower than the number of languages, and that many languages thus use the same script. For instance, most of the languages of western Europe use a Roman script; the Slavic languages of eastern Europe use a Cyrillic script. On the other hand, a single language can be written with more than one script. A classic example is Serbo-Croatian: the Serbs write their language with a Cyrillic script, while the Croats write the same language with a Roman script. Character sets are the computational analog of scripts, and the same relationships to language hold true: (15) Many languages may be encoded using the same character set. (16) A single language may be encoded using multiple character sets. In other words, there is a many-to-many relation between languages and character sets. Therefore Language and CharacterSet are modeled independently; the class LanguageEncoding is used to mediate the many-to-many relationship. The conceptual model of CharacterSet, which is also diagrammed in figure 7, is as follows: class CharacterSet has name : String description : String codePoints : seq of CodePoint baseBeforeDiacritics : Boolean Name is the name by which the character set is known throughout the system. Description is documentation for users of the system; it gives a brief explanation of the purpose and source of the character set. CodePoints and baseBeforeDiacritics are described below. Just as a script consists of a collection of graphic symbols (like letters of the alphabet), a character set consists of a collection of characters. But a character is not just a graphic symbol. Rather, it is an abstraction which associates a numerical code stored in the computer with a conventional graphic form. (The actual graphic forms are instantiated by fonts; see section 6.) This association of a numerical code with a graphic form we call a code point. But there is more to a character than this alone, because: (17) When two languages are encoded with the same character set, it means that characters at the same code point have the same graphic form but not necessarily the same function. For instance, whereas one language might use the colon (code 58) of the ASCII character set as a punctuation mark, another might use it to signal length of the preceding vowel. Or one language might use the apostrophe (code 39) as a quotation mark while another uses it for the glottal stop consonant. Thus: (18) A fully specified character associates a code point (that is, a numerical code and a graphic form) with a function. Therefore, our design formally distinguishes code points from characters. The codePoints attribute of CharacterSet is defined to store a sequence of CodePoints (see figure 7), and each CodePoint in turn stores a sequence of CharacterDefns. Each of these character definitions defines an association between its owning code point and the function it represents. When a LanguageEncoding declares which characters it uses, it selects these functionally specified CharacterDefns rather than CodePoints. Even though the CharacterDefns are ultimately motivated by the needs of LanguageEncodings, they are stored as part of the CharacterSet in order to maximize reuse of specifications. For instance, it is not desirable to define a new instance of colon as a punctuation mark for each of the hundreds of languages that can be encoded with a Roman-based character set like ANSI. Rather, the character set owns a single character definition and all language encodings point to the same character definition. The formal definitions of CodePoint and CharacterDefn are as follows: class CodePoint has name : String code : Integer functions : seq of CharacterDefn abstract class CharacterDefn has description : String unicodeID : String The name is a name for the graphic form associated with the CodePoint, such as "apostrophe" or "lowercase alpha." The description describes the function of the character if it is other than what would be assumed by the name of the code point. For instance, colon could be described as "vowel length" for the nonstandard use suggested above. Another way to describe the function of a character is to declare the corresponding character in Unicode. For instance, the apostrophe of ASCII (code 39) could be functioning as "apostrophe" (Unicode 02BC), "opening single quotation mark" (Unicode 2018), or "closing single quotation mark" (Unicode 2019). Note that the functions of a CharacterDefn is a sequence attribute so that such multiple functions for a single code point can be defined. A third way to specify something about the function of a character is by means of the subclass of CharacterDefn that is used. CharacterDefn itself is an abstract class; an instance must use one of the subclasses. The subclasses are BaseCharacter, Diacritic, ControlCharacter, WhiteSpace, PrecedingPunctuation, InternalPunctuation, and FollowingPunctuation. These specifications allow the system to parse a string of characters into its meaningful parts, like the individual words; more on this below in section 5.4. Only one subclass adds any attributes, namely, BaseCharacter. It adds a pointer to another BaseCharacter in the same character set that is the upper case equivalent of this one: class BaseCharacter based on CharacterDefn has upperCaseEquivalent : refers to BaseCharacter in functions of codePoints of owner of my owner lowerCaseEquivalent : seq of BaseCharacter means refsFromUpperCaseEquivalentOfBaseCharacter There is no need to set the lower case equivalent. It is computed as a virtual attribute that queries the backreference which is automatically set when upperCaseEquivalent is set. It allows multiple values since, for instance, both a and · might declare A as their upper case equivalent. A final note concerns the baseBeforeDiacritics attribute of CharacterSet. This stores a Boolean value that declares whether the diacritic characters are encoded before or after their base characters. For instance, if true, then a`e means ‡e; otherwise, it means aË. By putting this attribute on CharacterSet we are saying that even if two character sets have identical code points, there are still different character sets if one encodes diacritics after their bases, while the other puts them before. We consider them different character sets because the encoded strings have different meanings. They are also likely to require different fonts to handle the different discipline for placing the diacritic on its base. 5.3 Defining languages and their encodings As noted above in point (16), a single language may be encoded with multiple character sets. Figure 8 shows how this is modeled. The Language object has an attribute named encodings which may store multiple LanguageEncodings. Each LanguageEncoding stores a pointer to the CharacterSet it uses. The full conceptual model of Language is as follows: class Language has name : String XXXcode : String ISOcode : String encodings : seq of LanguageEncoding preferredEncoding : refers to LanguageEncoding in my encodings Name is the display name of the language. XXXcode is the three-letter code from the Ethnologue (Grimes 1992) which uniquely identifies this language from among the 6,000 plus languages of the world. ISOcode is the three-letter code from ISO 639 (ISO 1991); the current committee draft of this standard specifies unique codes for 404 of the world's major languages. PreferredEncoding points to one of the encodings for this language; it is the encoding that the system will use if it is asked to create a string in this language and no specific encoding is requested. The complete conceptual model for LanguageEncoding is as follows: class LanguageEncoding has name : String description : String language : Language means my owner characterSet : refers to CharacterSet in characterSets of owner of my owner characters : refers to seq of CharacterDefn in functions of codePoints of my characterSet multigraphs : seq of MultigraphDefn The first two attributes give a name to the encoding and a brief description of its purpose or source. Figure 6 shows that LanguageEncoding has an attribute to specify its language. Here we see that this is a virtual attribute; the language is not actually stored here but is retrieved by executing the query: my owner. CharacterSet indicates which character set is used for this method of encoding the language. It is a pointer to one of the character sets that is defined in the system Configuration (that is, owner of my owner; see figure 5). The characters attribute defines the characters that are used in this encoding of the language. It does so by pointing to the appropriate CharacterDefns; note that the in clause of the attribute specification limits the selection to only those CharacterDefns that are defined in the character set used by the encoding. By pointing to a selection of the possible CharacterDefns, this attribute does two things. First, it allows the designer of the encoding to omit (and thus disallow) code points that are not used for encoding this language. Second, as noted above in point (17), the same code point can be used for different functions in encoding different languages. This attribute allows the designer of the encoding to select the particular CharacterDefn (and thus the particular function) that is desired for a particular code point. The final attribute of a LanguageEncoding declares its multigraphs. These are sequences of characters which for some purposes (such as sorting or breaking into sound units) should be considered as a single unit; for instance, in German the trigraph sch represents a single sound (the grooved alveopalatal fricative) and the digraph ng represents another single sound (the velar nasal). A MultigraphDefn has the following conceptual model: class MultigraphDefn has name : String description : String characters : refers to seq of CharacterDefn in characters of my owner That is, it has a name and a description (as would a single character). It also specifies the sequence of characters that form the multigraph by pointing to the CharacterDefns for the characters. Note that in this case, the in clause is even more restricted than it was for characters of the LanguageEncoding. Here the possible values are limited to the characters that are defined in my owner (that is, the LanguageEncoding). 5.4 Using encoding information to tokenize strings One kind of processing that is typically done on text strings is to break them down into meaningful subcomponents, or tokens. For some purposes one might want to view the string as a sequence of characters, for others as a sequence of sound units (where multigraphs are treated as single tokens), and for still others one might want to treat the string as a sequence of wordforms and punctuation marks. The kinds of information stored in CharacterSets and LanguageEncodings makes it possible for CELLAR to perform such tokenization of strings using builtin functions. They are as follows: characterTokenize Returns a sequence of CharacterDefns for the code points in the String. baseCharacterTokenize Returns a sequence of Strings, each of which is either a base character with its associated diacritics or a single character of any other type.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Microdifferentiation in a natural population of Drosophila melanogaster to alcohol in the environment.

Strains of D. melanogaster derived from a vineyard population were more resistant to alcohol in the environment than strains from a population derived from an area removed from the vineyard. Within the vineyard population those strains most closely associated with alcohol in the environment in the cellar were more resistant than those collected outside the cellar. There was evidence of gene flo...

متن کامل

CELLAR: A Data Modeling System for Linguistic Annotation

CELLAR is not a particular annotation schema, but is a system for expressing and building annotation schemas. The paper illustrates how an annotation schema is expressed as an XML document that defines classes of objects, their properties, and the relationships between objects. The schema is then implemented via automatic conversion to a relational database schema and an XML DTD for data import...

متن کامل

Persistence of Two Non-Saccharomyces Yeasts (Hanseniaspora and Starmerella) in the Cellar

Different genera and/or species of yeasts present on grape berries, in musts and wines are widely described. Nevertheless, the community of non-Saccharomyces yeasts present in the cellar is still given little attention. Thus it is not known if the cellar is a real ecological niche for these yeasts or if it is merely a transient habitat for populations brought in by grape berries during the wine...

متن کامل

Åùðøøððòòùùð Áòòóöññøøóò Èöó×××òò Óò Êêððøøóòòð Øøøø×× Ööööøøøøùöö×

EÆcient storage and query processing of data spanning multiple natural languages are of crucial importance in today's globalized world. A primary prerequisite to achieve this goal is that the principal data repositories, relational database systems, should eÆciently and seamlessly support multilingual data. Our survey of current relational systems indicates that while they do support storage an...

متن کامل

An Efficient Resource Allocation for Processing Healthcare Data in the Cloud Computing Environment

Nowadays, processing large-media healthcare data in the cloud has become an effective way of satisfying the medical userschr('39') QoS (quality of service) demands. Providing healthcare for the community is a complex activity that relies heavily on information processing. Such processing can be very costly for organizations. However, processing healthcare data in cloud has become an effective s...

متن کامل

NIOC’s requirements for data processing and interpretation in challenging geological environment

The time for easy oil discovery and production for National Iranian Oil Company (NIOC) is over. This means that the oil is no longer discovered in structurally simple, i.e., almost flat environments like south of Khuzestan province (south west of Iran). This comes along with the fact that Iran’s biggest oil reservoirs are in this area, and they are passing half of their life cycle. These giant ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1995